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ABSTRACT 

A  natural  way  to  communicate  with  C2  systems  would  be  to  use  natural  language.  There  are  already  natural 
language  components  used  in  military  systems,  e.g.  CommandTalk  is  a  spoken-language  interface  to  the 
ModSAF  battlefield  simulator.  In  our  project  SOKRATES  we  use  shallow  parsing  techniques  for  written 
language  to  analyze  German  free-form  battlefield  reports.  These  reports  are  processed  by  transducers. 
The  extraction  result  is  formalized  in  feature  structures,  semantically  enriched  by  the  semantic  analysis  and 
the  augmented  result  is  then  stored  in  the  ATCCIS  database.  After  storing  in  the  database,  triggers  initiate  a 
change  in  the  position  of  a  tactical  symbol  on  the  tactical  map.  Shallow  parsing  techniques  are  the  basis  for 
information  extraction.  In  this  paper,  we  therefore  first  introduce  to  the  promising  field  of  information 
extraction.  Then,  we  describe  in  detail  how  shallow  parsing  techniques  are  used  in  our  project  SOKRATES. 


INTRODUCTION 

In  the  NATO  technical  report  Potentials  of  Speech  and  Language  Technology  Systems  for  Military  Use:  an 
Application  and  Technology  Oriented  Survey  (see  [Steeneken,  1996])  the  processing  of  human  language  was 
recognized  as  a  critical  capability  in  many  future  military  applications,  among  other  things  in  ‘command  and 
control’.  Though,  still  a  topic  of  research,  there  already  exist  natural  language  components  in  military  systems, 
e.g.  CommandTalk,  a  spoken-language  interface  to  the  ModSAF  battlefield  simulator  (see  [Moore  et  al, 
1997],  [Stent  et  al,  1999],  [CommandTalk])  or  the  Phraselator  (see  [Phraselator,  2003])  used  by  the  US  Army 
in  Afghanistan. 

Today,  the  usability  of  human  language  technology  is  restricted  to  narrow  and  well-defined  application  areas 
(domains).  Another  requirement  is  that  the  language  must  be  restricted  as  well.  This  means,  that  the 
vocabulary  and  the  grammatical  structures  must  be  limited  enough  such  that  processing  time  becomes 
acceptable.  The  military  domain  and  the  stereotyped  military  command  language  seem  to  be  appropriated  for 
using  language  technology. 

A  natural  way  to  input  information  in  a  C2  system  would  be  to  use  written  or  spoken  natural  language.  In  this 
context,  we  tried  to  show  in  a  former  approach  that  the  available  methods,  techniques,  and  tools  of 
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ORGANIZATION 


computational  linguistics  are  mature  enough  to  test  their  applicability  to  C2  systems.  In  this  former  approach 
we  used  the  ATCCIS  database  (cf  [NATO,  2000])  as  the  domain.  We  have  already  reported  on  the  progress 
of  our  former  project  using  a  speech  recognizer  (cf  [Recking,  2001]).  Furthermore,  we  showed  (cf  [Recking, 
2002])  how  to  use  deep  syntactic  analysis  techniques  to  analyze  simple  written  sentences.  This  approach 
however  has  various  deficiencies.  Especially,  the  high  demand  on  processing  time  and  the  expectation,  that 
the  sentences  are  all  well  structured  with  respect  to  the  grammar,  hampers  the  application  of  these  techniques. 
Therefore,  we  were  looking  for  an  alternative  approach  that  avoids  these  deficiencies. 

Information  extraction  (IE)  is  an  engineering  approach  based  on  results  of  computational  linguistics  to  build 
systems  that  process  huge  amount  of  texts  of  a  specific  sort.  IE  is  an  approach  that  avoids  the  deficiencies 
mentioned  above.  Each  IE  system  is  tailored  to  a  specific  domain  and  task.  IE  uses  a  shallow  syntactic 
approach,  i.e.  that  only  parts  of  the  sentences  (so-called  ‘chunks’)  are  processed  with  finite  state  automaton  or 
transducers.  These  parts  contain  the  relevant  information  about  the  Who,  What,  When,  etc.  To  realize  an  IE 
system  the  desired  output  must  be  specified.  This  is  done  through  templates.  Templates  represent  feature- 
value  structures.  During  the  IE  process  a  domain-specific  lexicon  and  domain-specific  rules  are  used  to 
instantiate  the  templates. 

In  this  paper,  we  will  first  give  a  short  introduction  into  the  promising  field  of  IE.  Then,  we  show  how  we  use 
the  SMES  system  in  our  project  SOKRATES  to  realize  an  IE  system  for  battlefield  reports  in  German.  We 
will  describe  the  various  steps  during  the  IE  process  and  we  will  explain  in  detail  the  shallow  parsing 
techniques  used. 


INFORMATION  EXTRACTION 

Information  extraction  (IE)  is  the  task  of  identifying,  collecting  and  normalizing  information  from  natural 
language  text  (see  [Appelt,  1999],  [Pazienza,  1999]).  Relevant  information  about  the  Who,  What,  When,  etc. 
is  looked  for.  The  information  of  interest  is  described  through  patterns  called  templates.  During  the  IE  task 
these  templates  are  filled  with  the  collected  information.  IE  therefore  can  be  seen  as  the  process  of 
normalizing  from  free-form  text  into  a  defined  structure.  The  templates  are  domain  and  task  specific,  i.e.  for 
each  new  task  and  domain  they  must  be  newly  created. 

To  realize  an  IE  system,  language  resources  (lexicon,  grammar)  and  appropriated  parsing  software  are 
necessary.  This  software  must  be  language-specific.  The  IE  tools  for  the  English  language  are  not 
appropriated  for  analyzing  e.g.  German  text  due  to  the  free-order  of  the  language. 

In  order  to  achieve  robust  and  efficient  IE  systems,  domain  knowledge  must  be  integrated  and  shallow 
algorithms  must  be  used.  The  domain  knowledge  is  tightly  integrated  with  the  language  knowledge,  e.g.  the 
name  ‘Leopard’  in  the  lexicon  has  the  categorical  information  ‘tank’.  This  association  between  words  and 
semantic  information  is  domain-specific  and  has  to  be  change  for  other  applications. 

The  current  IE  technology  are  used  successfully  in  various  application  areas,  e.g.  intelligent  information 
retrieval,  linguistically  based  data  mining,  automatic  term  extraction,  text  classification. 

The  IE  process  itself  is  divided  into  sub  steps.  After  tokenizing  the  text,  the  sentence  boundaries  must  be 
identified.  Then,  the  morphological  component  identifies  the  word  stems,  the  abbreviation  and  detects  the 
syntactic  information  (e.g.  grammar  case  and  gender).  After  this,  the  chunk  parsing  with  transducers  selects 
parts  of  the  natural  language  text,  which  are  relevant  for  the  specific  information  extraction  task.  The  chunks 
are  then  used  to  instantiate  the  templates,  which  represent  the  result  of  the  IE  process. 
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Various  toolboxes  are  available  to  build  IE  systems.  These  toolboxes  must  be  language  specific.  A  powerful 
IE  toolbox  for  German  is  the  SMES  system  (cf  [Neumann,  2003]).  This  toolbox  offers  among  other  things  a 
morphological  analysis  component  with  a  huge  lexicon,  predefined  grammars  (transducers)  for  specific 
phrases  (e.g.  noun  phrases)  and  the  possibility  to  program  arbitrary  transducers. 


THE  PROJECT  SOKRATES 

The  overall  objective  of  the  SOKRATES  project  is  to  analyze  written  German  battlefield  reports.  The  result  of 
the  analysis  is  stored  in  the  ATCCIS  database  (see  [NATO,  2000]).  These  stored  results  can  be  used  for 
different  purposes.  One  purpose  is  that  location  changes  of  units  initiate  automatically  changes  of  tactical 
symbols  on  the  tactical  map. 

The  Architecture 

The  architecture  is  shown  in  Fig.  1.  The  free-form  reports  are  handed  over  to  the  coordination  module,  which 
is  responsible  for  all  the  coordination  in  the  system.  In  a  first  step,  the  syntactic  preprocessing  identifies  the 
sentence  boundaries.  Next,  the  information  extraction  module  uses  the  lexicon  and  the  grammar  transducers  to 
identify  and  select  the  relevant  parts  in  the  natural  language  text.  These  parts  are  represented  as  typed  feature 
structures  that  are  coded  as  an  XML  document.  The  result  of  the  information  extraction  is  used  by  the 
semantic  analysis  component  to  deduce  more  information  out  of  the  extracted  information  with  the  help  of  an 
ontology  and  the  context  (see  [Schade,  2003a],  [Schade,  2003b]).  After  the  semantic  analysis  the  result  is 
pushed  into  the  ATCCIS  database  and  then  it  is  used  to  alter  automatically  the  position  of  tactical  symbols  on 
the  map. 

The  Information  Extraction  Module 

During  the  information  extraction  the  structure  of  a  text  must  be  determined.  To  do  this,  a  grammar  with  a 
lexicon  and  a  parser  is  necessary.  There  are  a  lot  of  different  grammar  formalisms  for  natural  language 
processing.  To  be  able  to  process  large  amounts  of  text  an  efficient  approach  must  be  used.  Therefore,  we  use 
a  shallow  parsing  approach,  i.e.  that  only  parts  of  the  sentences  (so-called  ‘chunks’)  are  processed.  These 
transducers  code  the  necessary  grammar. 

The  result  of  the  syntactic  analysis  is  represented  in  templates.  The  necessary  templates  are  formalized  in 
SOKRATES  by  typed  feature  structures  (see  [Pollard  and  Sag,  1994]).  These  structures  consist  of  name- 
value-pairs.  In  the  simplest  case,  the  feature  values  can  be  strings,  numbers  or  other  atomic  values.  But  the 
values  can  also  be  whole  feature  structures.  In  the  following  example 

(:TIME 

(iSECOND  .  "???”) 

(:MINUTE  .  35) 

(:HOUR .  10) 

(:DAY .  9) 

(MONTH  .  ’’September") 

(:YEAR .  ’’???’’) 

(:TIMEZONE  .  ’’???’’) 

(:TYPE  .  :TIME) 

) 
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the  feature  with  name  :TIME  of  type  'time'  ((:TYPE  .  :TIME))  has  a  feature  structure  ((:  SECOND  .  "???")  ... 
(:TIMEZONE  .  "???"))  as  its  value.  In  contrast,  the  feature  with  name  :HOUR  has  an  atomic  value  (10). 
Unknown  values  are  represented  by  "???".  Feature  values  can  be  accessed  through  paths 
(e.g.  :TIME|:MINUTE  gives  the  '35'  value). 

The  feature  structures  used  form  an  inheritance  hierarchy  (see  Fig.  2).  This  hierarchy  describes  completely  all 
possible  structures  that  the  IE  module  can  use  and  might  instantiate.  E.g.  in  Fig.  2  the  type  'template'  has  two 
subtypes  'report'  and  'order'.  The  report-type  feature  structure  has  various  name-value-pairs,  e.g.  'laddressee 
partner'.  The  value  partner  is  itself  a  feature  structure. 

The  information  extraction  process  by  itself  is  realized  with  shallow  algorithms.  These  are  called  transducers. 
Transducers  are  finite-state  automaton  that  read  from  an  input  stream  and  write  to  an  output  stream.  Theses 
automaton  can  be  cascaded,  i.e.  that  the  output  of  one  transducer  is  the  input  in  another  one.  For  example, 
the  names  of  various  locations  are  the  input  to  a  transducer,  which  constructs  a  feature  structure  that 
formalizes  the  recognized  name  of  the  location.  This  feature  structure  is  than  handed  as  an  input  to  the 
transducer  responsible  for  detecting  e.g.  goal  expressions.  During  the  syntactic  analysis  of  the  free-form 
reports  with  transducers  more  and  more  complex  feature  structures  are  constructed. 

In  SOKRATES  we  use  as  an  IE  tool  the  SMES  system  (see  [Neumann,  2003]).  This  system  contains  a  huge 
German  lexicon  and  offers  the  possibility  to  create  transducers.  SMES  is  implemented  in  Allegro  Common 
Lisp  (see  [ACL,  2002])  and  therefore  offers  also  the  whole  functionality  of  Common  Lisp  as  well. 
The  following  example  shows  a  simple  transducer  for  detecting  who  reports,  when  and  where: 

(compile-regexp 

'(:conc 

(:current-pos  start) 

(:seek  so-date-time  date) 

(:seek  so-meldender  meldender) 

(:seek  so-standort-meldender  standort) 

(:morphix-punctuation  ":") 

(:current-pos  end) 

) 

:debug  *debug* 

: register- types  '((:register  start  date  meldender  standort  end)) 

:output-desc 

'(:lisp  (multi-aeons  :speaker  meldender  :time  date  :location  standort)) 
iprefix  T 
:  suffix  nil 

:name  'so-meldung-prolog 
icompile  ^compile* 

:write-to-file  *trace-file* 

) 

The  name  of  this  transducer  is  'so-meldung-prolog'  (:name  'so-meldung-prolog).  It  consists  of  a  concatenation 
(:conc)  of  calls  to  other  transducers  (e.g.  (:seek  so-date-time  date))  and  a  call  to  the  morphological  component 
((:morphix-punctuation  ":")).  If  the  called  transducers  recognize  successfully  the  appropriate  parts  in  the 
report,  they  construct  a  feature  structure  describing  their  recognition  result.  This  result  is  than  handed  over  to 
the  calling  transducer,  e.g.  in  variable  date  of  the  calling  statement  (:seek  so-date-time  date).  The  shown 
transducer  uses  the  recognition  result  to  construct  its  own  feature  structure  (:output-desc  '(:lisp  (multi-aeons 
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ispeaker  meldender  :time  date  ilocation  standort)))  which  is  than  passed  up  to  the  calling  transducer  of  the 
shown  transducer. 

In  the  current  implementation  the  SOKRATES  system  is  able  to  process  simple  reports  about  moving  objects. 
One  example  is  the  following  report:  ''09.  September  10.35  Uhr  von  6./PzMrs332-Zug  B  in  Vinstedt: 
18  Fahrzeuge  marschieren  bei  Strafienkreuzung  Kr3  ndrdlich  Eppensen  nach  Ebstorf." 
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Figure  1:  The  Architecture  of  SOKRATES. 
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feature  structure 


:type  theme" 

: objects  {object} 


report  order  attack  move 


:type  ":report" 

: addressee  partner: 

: medium  medium 
:message  {action} 

:  speaker  partner 
:time  time 
:  location  location 
: credibility  credibility 


:type  ":move" 

: theme  theme 
:  source  location 
:path  {location} 
:goal  location 
:area  location 
:  distance  number 
:  speed  number 
:  carrier  object 
:  start- time  time 
: duration  number 


Figure  2:  A  Part  of  the  Feature  Structure  Hierarchy. 


(September  9*,  10.35  a.m.  from  6./PzMrs332-Zug  B  in  Vinstedt:  18  vehicles  march  at  road  crossing  Kr3  north 
of  Eppensen  to  Ebstorf ).  After  processing  this  report  the  result  is  represented  as  the  following  typed  feature 
structure: 

( 

(iCREDIBILITY  .  "???") 

(:LOCATION 

(:NAME  .  "vinstedt") 

(:TYPE  .  iLOCATION) 

) 

(:TIME 

(iSECOND  .  "???") 

(iMINUTE  .  35) 

(:HOUR  .  10) 

(:DAY .  9) 

(:MONTH  .  "September") 

(:YEAR . "???") 

(:TIMEZONE  .  "???") 

(:TYPE  .  :TIME) 

) 
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(:SPEAKER 

(:NAME  .  "6/pzmrs/332/zug/b") 
(:TYPE  .  lUNIT) 

) 

(MESSAGE 

(:SET 


( 


) 


(:DURATION  .  "???") 

(:START-TIME  .  "???") 

(:CARRIER .  "???") 

(:SPEED  .  "???") 

(:DISTANCE  . "???") 

(:AREA  .  "???") 

(:GOAL 

(:QUALIFIER .  :TO) 

(:NAME  .  "ebstorf) 

(:TYPE  .  iLOCATION) 

) 

(:PATH . "???") 

(:SOURCE 

(:QUALIFIERS 

(:SET 

( 

(:QUALIFIER .  "nordlich") 
(:NAME  .  "eppensen") 
(:TYPE  .  :LOCATION) 

) 

) 


) 

(:COORDINATES  .  "kr3") 
(iQUALIFIER .  ;EXACTLY-AT) 
(:NAME  .  "StraBenkreuzung") 
(:TYPE  .  iLOCATION) 


) 

(iTHEME 

(lOBJECTS 

(iSET 

( 


) 

) 

) 

(iTYPE  .  iTHEME) 


(iCOUNT.  18) 
(iNAME  .  "Fahrzeuge") 
(iTYPE  .  iVEHICLE) 


) 

(iTYPE  .  MOVE) 
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) 

) 

(:MEDIUM  .  :LETTER) 

(lADDRESSEE  .  ”???”) 

(:TYPE  .  :REPORT) 

) 

The  above  feature  structure  is  of  type  'report'.  Each  report  might  contain  information  about  the  addressee  of 
the  report  (:ADDRESSEE),  the  medium  in  which  it  was  formulated  (in  this  case  :LETTER),  the  message  itself 
(:MESSAGE),  the  unit  or  person  who  sends  the  report  (: SPEAKER),  the  time  of  reporting  (:TIME), 
the  location  of  the  unit  or  person  who  reports  (:LOCATION)  and  the  credibility  of  the  unit  or  person 
(: CREDIBILITY).  If  the  IE  module  doesn’t  find  the  appropriate  information  the  string  "???"  is  delivered. 
The  message  contains  a  set  (:SET)  of  action  descriptions.  In  the  example  it  is  a  move-action  (:MOVE).  Words 
like  "marschieren",  "fahren",  "schwimmen",  "fiiegen"  (to  march,  to  drive,  to  swim,  to  fly)  are  mapped  to  the 
move-action.  The  description  of  the  move-action  contains  various  features:  the  duration  (:DURATION), 
the  start-time  (:START-TIME),  the  carrier  (:CARRIER),  the  speed  (:SPEED),  the  distance  (iDISTANCE), 
the  area  (:AREA),  the  goal  (:GOAL),  the  path  (:PATH),  the  source  (: SOURCE)  and  the  theme  (:THEME)  of 
the  action.  The  goal  of  the  move  is  described  with  the  help  of  a  feature  structure  of  type  :LOCATION. 
It  formalizes  in  our  example  the  city  Ebstorf  as  the  goal  of  the  march.  The  starting  point  of  the  move-action  is 
given  after  the  feature  name  : SOURCE.  This  is  also  a  feature  structure  of  type  :LOCATION.  It  gives  the 
coordinates  ((:COORDINATES  .  "kr3"))  of  the  crossing  ((:NAME  .  "StraBenkreuzung"))  and  it  gives  also  a 
qualifying  statement  that  the  crossing  is  in  the  north  ((:QUALIFIER  .  "nordlich"))  of  the  city  Eppensen 
((:NAME  .  "eppensen")).  The  theme  describes  which  objects  (lOBJECTS)  are  moving.  In  our  example 
18  vehicles  ((:COUNT  .  18)(:NAME  .  "Fahrzeuge")(:TYPE  .  : VEHICLE))  are  moving. 

The  SOKRATES  system  is  implemented  and  is  able  to  process  simple  examples  as  shown  above.  Up  to 
now,  the  lexicon  was  only  extended  with  a  few  military  specific  words  (but  it  contains  already  more  then 
120,000  German  word  stems).  The  next  steps  will  be  the  extension  of  each  module  to  enhance  the  processing 
capabilities. 


CONCLUSION 

In  this  paper,  we  introduced  in  the  promising  field  of  information  extraction  and  we  gave  a  description  of  the 
shallow  parsing  techniques  used  in  our  project  SOKRATES.  After  presenting  the  overall  architecture  of  our 
system,  we  have  shown  how  the  syntactic  analysis  is  done  with  transducers  and  how  feature  structures  are 
used  to  describe  and  to  store  potential  analysis  results.  We  described  how  a  simple  German  battlefield  report 
was  analyzed  and  how  the  analysis  result  was  represented  in  a  feature  structure. 
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Why  this  work? 


Dr.  M.  Necking 


Lt.  Col.  Baron  Louis  de  Chantal: 

Major  limit  of  fusion  is  human,  not  technical.  There  is  a  real 
need  of  fusion  to  free  rnarf  from  low  level  tasks. 


J  oachim  Biermann: 

We  must  improveTinforniation  extradiiofi  and  text 
understanding  for  template- based  information  fusion, 


Dr.  Neil  J  ames  Gordon: 


information  extractiofj  from  unstructured  free  text. 
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■  Information  Extraction 
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starting  Situation 


Dr.  M.  Necking 


■  There  are  a  lot  of  natural  language  texts  (military  reports, 
emails,  web  sides,  scientific  reports,  documents,  ...)  which 
can't  be  evaluated  due  to  missing  specialists. 

■  Which  technical  possibilities  exist  of  automating  the  content 
extraction? 

^  Practical  approach:  Information  Extraction  (IE) 

-FGAN - -§ 
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I  nformation  Extraction: 


Dr.  M.  Necking 


■  I  nformation  extraction  (IE)  is  the  task  of  identifying, 

collecting  and  normalizing  information  from  natural  language 
text. 


■  Relevant  information  about  th^Who,  What.*When,  eta  is 

I-  -  ■  _l  I  I_|  iSKu  M  l^-IJ *"1  r-l,  M  d-l  I  ■  ■  I  Wu  u  ■  L  — I  I n n n _  _  _l 1  u  r* 


looked  for. 

■  The  information  of  interest  is  described  through  domain- 
specific  lexicon  rules  and  patterns  called  templates. 

■  During  the  I E  task  these  templates  are  filled  with  the 
collected  information. 


■  The  templates  are  ^k  spedfici  i.e.  for  each  new 

- _ C— ' — h _ i-k_n_i— w_  _ i-h_  _ i— '  '  “  ^ 


task  and  domain  they  must  be  newly  created. 
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Partial  analysis  of  a  militarY  free-form  report  from  our  Bosnia 
scenario: 


when  and  who  reports 


„09.  September  10.35  Uhr  von  Mechl  nfBtl  (US)  inTUZLA: 

10. 20  Uhr  18.^#frzeuae.  davon  8  mit  angehangten  Zl  S  -  3  und  1 
mit  angehangter  rT  12  fnaitechlareih  bei  StralSenkreuzung  (CQ 
072368T  sudlich7  Ml  tESKIJ  f  (CQ  0737)  nach  Norden 


/ 


who 


where 


what 
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Report 


Formal  representation  of  content 


.  in  TUZLA: 

10.20  Uhr 

18,Eahrzeuge,  da  von 

8  mit  angehangten 
Zl  S  -  3  und  1  mit 
angehangter  T  12 


bei  StfaBenkreuzung 
(CQ  072368)  stidlich 
MILESKIJ  (CQ0737) 


report 


addressee .. 


message 


move 


objects 

vehicles 


area 


location 


name "???" 
count  18 


name  crossing 
coordinates  cq072368 


nach  Norden." 
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Formal  representation 


Use  it 


report 


addressee  ... 


message 


move 


objects 

vehicles 


area 


location 


ATCCI S 


Analysis 


who,  when,  where 

)ti6n.analysisl 
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1st  step:  Lexical  Analysis 


Dr.  M.  Necking 


■  iLexiGalaanalysis: 

♦  subtasks: 

>  identify  the  part-of-speech  (POS,  e.g.  'marschieren'  is 
a  verb) 

>  identify  the  morphological  information  (e.g.  nom,  pi) 
^analyze  compound  nouns  (e.g.  StraBen-kreuzung) 

♦  uses  big  German  lexica  (language  knowledge) 

♦  morphological  information  is  recognized  with  finite  state 
automaton  (transducer) 
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2nd  Step:  Recognition  of  Speciai  Phrases 


Dr.  M.  Necking 


■  R^pgnitioBPof  special  phrases: 

♦  date/ time  expressions: 

"18.12.1998",  "Freitag,  der  achtzehnte  Dezember  1998" 
^  [type:  date,  year:  1998,  month:  12,  day:  18] 

♦  entity  name  recognition  (people,  organizations, 
companies,  military  units,  places,  ...)  with  lists  or  finite 
state  automatons,  (e.g.  Kanzier  Schroder,  G.  Schroder). 

♦  number  expressions,  addresses,  forms,  ... 


— FGAN 
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3rd  Step:  Recognition  of  General  Phrases 


Dr.  M.  Necking 


■  Recognitipriaof  general  phrases: 

♦  nominal  phrases: 

"die  erfolgreiche  Verteidigung"  ^  [head:verteidigung, 
quant:  def,  mod:erfolgreich] 

♦  prepositional  phrases: 

"fur  die  erfolgreiche  Verteidigung"  ^  [head:fur,  comp:[ 
head: verteidigung,  quant: def,  mod:erfolgreich]] 

♦  verb  phrases: 

>  problem:  discontinuous  constituents 
> example:  "Gestern  marschierten ab." 
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Technical  Basis  - 1 
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■  Technical  bases  of  the  information  extraction: 


♦ 

♦ 


[extensive  linguistic  knowJ[eclg^ 

>  general  and  application-specific  lexica 

>  general  and/or  application-specific  phrase  grammars 

>  general  (and/or  application-specific)  clauses  and 
sentence  grammars 

cascaded  transducers,  i.e.  finite  state  automaton  that 
reads  from  the  input  and  writes  to  the  output 

only  application-relevant  parts  of  the  texts  are  analyzed 
through  transducers  (shallow^arsing  techniques) 


♦ 

♦ 


19-12 


FGAN 


Research  Institute  for  Communication,  Information  Processing,  and  Ergonomics 
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♦  Example  of  a  transducer: 

(compile- regexp 
'(:conc 

(:  morph ix- form  "zwischen")  (:star<=n  (:seek  ortsvorfeld  vfl)l) 
(:seek  ortsname  namel) 

(:  morph ix-form  "und")  (:star<=n  (:  seek  ortsvorfeld  vf2)l) 
(:seek  ortsname  nanne2)) 

: register-types  '((:single-match  vfl  vf2  namel  nanne2)) 

:output-desc  ... 

mame 
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feature  structure 


report  order  move  attack  ...  withdraw 


addressee  partner 
medium  medium 
message  {actiony 
speaker  partner 
location  location 
credibility  credibility 


FGAN 


theme  theme 
source  location 
path  direction 
goal  location 
area  location 
distance  number 
speed  number 
carrier  object 
start- time  time 
duration  number 
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Project  Sokrates  - 1 


Dr.  M.  Necking 


The  research  project  Sokrates: 

■  The  overall  objective  of  the  Sokrates  project  is  to  analyze 


ritten»Germarijbattlefield  reports. 


■  The  applicability  of  the  information  extraction  technology  in 
the  military  domain  will  be  evaluated. 


■  Combines  I E  with  thelsemantic  analy^s]  in  which  ontologies 
are  used. 

■  The  result  of  the  analysis  is  stored  in  the  ATCCI S  database. 
These  stored  results  can  be  used  for  different  purposes. 

■  One  purpose  is  that  location  changes  of  units  initiate 
automatically  changes  of  tactical  symbols  on  the  tactical  map. 
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Syntactic 

Preprocessing 


Coordination 

Modul 


Semantic 

Analysis 
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Project  Sokrates  - 1 1 1 
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■  A  first  version  of  the  system  was  realized. 

■  The  Sokrates  system  is  able  to  process  simple  examples. 

The  next  steps: 

■  The  functionality  of  the  modules  must  be  extended.  The 
system  should  be  able  to  process  more  linguistically 
interesting  problems. 

■  The  lexica  must  be  extended. 


19-17 


FGAN 


Research  Institute  for  Communication,  Information  Processing,  and  Ergonomics 


Screen  Shots  - 1 


Dr.  M.  Necking 


FGAN- 


19-18 


Research  Institute  for  Communication,  Information  Processing,  and  Ergonomics 


KIE 


Screen  Shots  - 1 1 


Dr.  M.  Necking 


m 


COSMoS  ver.  0.78 


Datei  Bearbeiten  Hilfe 


■ 


-  A 


(  Eingabe  |f  Trace~|f^usgabe^|f^eran^hauirchung~| 


Eingabe  Meldung 


<?xml  version- '1 .0"  encoding- 'UTF-0"?> 

<feature-structure> 

<fe  atu  re  n  a  m  e="ty p  e"  >  <  ato  m  i  c  >  re  p  o  rt</ ato  m  i  c  >  </fe  atu  re  > 

<fe  atu  re  n  a  m  e="  s  e  n  d  e  r"  >  <fe  atu  re-  stru  ctu  re  > 

<feature  name="type"><atomic>unit</atomic></feature> 

<feature  name-'name"><atomic>2./PzGrenBti332-ZugC</atomic></feature> 

</fe  atu  re-  stru  ctu  re  >  </fe  atu  re  > 

■=:fe  atu  re  n  a  m  e="  a  d  d  re  s  s  e  e"  >  <fe  atu  re-  stru  ctu  re  > 

<feature  name-'type"><atomic>unit</atomic></feature> 

<feature  name-'name"><atomic>2./PzGren332</atomic></feature> 

</fe  atu  re-  stru  ctu  re  >  </fe  atu  re  > 

<fe  atu  re  n  a  m  e="  re  p  o  rti  n  g_d  ateti  m  e"  >  <  ref  n  u  m  b  e  r="  2"  >  <fe  atu  re-  stru  ctu  re  > 

<fe  atu  re  n  a  m  e="ty  p  e"  >  <  ato  m  i  c  >  d  ateti  m  e  </ato  m  i  c  >  </fe  atu  re  > 

<feature  name- 'month"><atomic>1 0</atomic></feature> 

<fe  atu  re  n  a  m  e="  d  ay  >  <  ato  m  i  c  >  2  </ato  m  i  c  >  </fe  atu  re  > 

<feature  name="hour>-=:atomic>1 2</atomic></feature> 

<feature  name-'minute"><atomic>0</atomic></feature> 

</fe  atu  re-  stru  ctu  re  >  </ref>  </fe  atu  re  > 

■^feature  name-'reporting_data"> 

<iist> 

<feature- stru  ctu  re> 

<fe  atu  re  n  a  m  e="ty p  e"  >  <  ato  m  i  c  >  m  o ve  </ato  m  i  c  >  </fe  atu  re  > 

■=:fe  atu  re  n  a  m  e="  a  g  e  nf '  >  <fe  atu  re-  stru  ctu  re  > 

<feature  name-'type"><atomic>unit</atomic></feature> 

<feature  name-'name"><atomic>6./PzMrs332-ZugB</atomic></feature> 

</fe  atu  re-  stru  ctu  re  >  </fe  atu  re  > 

<fe  atu  re  n  a  m  e="  d  i  re  cti  o  n"  >  <fe  atu  re-  stru  ctu  re  > 

<feature  name-'type"><atomic>iocation</atomic></feature> 
<fe  atu  re  n  a  m  e="  n  a  m  e"  >  <  ato  m  i  c  >  kr3  </ato  m  i  c  >  </fe  atu  re  > 
</fe  atu  re-  stru  ctu  re  >  </fe  atu  re  > 

^feature  name="start_time"><ref  number="27></feature> 
</feature-structure> 

</iist></feature> 

</feature-structure> 


Meldung  laden 


Meldung  Idschen 
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Feature  Structures  as  XML  Documents 


Dr.  M.  Necking 


■  The  extracted  content  is  passed  to  the  coordination  rrodule  as  a 
feature  structure  in  XML  format. 

■  E.g.: 

<?xml  version="1.0"  encoding="iso-8859-l"?> 

<!-  Project  SOCRATES  -  (FGAN)  -> 

<featu re- structure  xmlns:xsi="http://www. w3.org/2001/XMLSchenna-instance" 
xsi :  noNa  mespaceSchenna  Locati  on ="  u :  /  Socrates/  Doku  mente/  XM  \J  I  nterface.  xsd ' ' 

<f eatu  re-  structu  re> 

<feature  name  ="LOCATION"> 

<f  eatu  re-  structu  re> 

<feature  name  ="NAME"> 

<atomic>vi  nstedt</atomic> 

</feature> 

<feature  name  ="TYPE"> 

<atomic>LOCATI  ON</atomic> 

</feature> 

</f  eatu  re-  structu  re> 

</feature> 

<feature  name  ="REPORTI  NG_DATETI  ME"> 
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Used  I E  Tool 
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SMES: 

■  Saarbrucker  Message  Extraction  System,  since  1997,  G. 
Neumann  (DFKI),  http://www.clfl<i.cle/~neumann/pcl- 
smes/  pd-  smes.  html , 

■  toolset  for  the  construction  of  I E  systems  for  the  German 
language, 

■  components:  scanner,  morphological  analysis,  cascaded 
chunk  parser, 

■  predefined  grammars  for  the  recognition  of  special  and 
general  phrases, 

■  lexica:  >  120  000  word  stems,  >  25  000  verb  frames, 

■  in:  Allegro  CommonLisp. 
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Conclusion 


Dr.  M.  Necking 


■  An  introduction  to  the  promising  field  of  information 
extraction  (I  E)  through  an  example  was  given. 

■  A  description  of  the  different  steps  during  the  I E  process  was 
presented. 

■  Our  research  project  Sokrates  was  described  in  detail. 

■  We  presented  the  tool  used  in  our  project  Sokrates. 
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